Unsupervised Deep Auditory Model Using Stack of Convolutional RBMs for Speech Recognition
نویسندگان
چکیده
Recently, we have proposed an unsupervised filterbank learning model based on Convolutional RBM (ConvRBM). This model is able to learn auditory-like subband filters using speech signals as an input. In this paper, we propose two-layer Unsupervised Deep Auditory Model (UDAM) by stacking two ConvRBMs. The first layer ConvRBM learns filterbank from speech signals and hence, it represents early auditory processing. The hidden units’ responses of the first layer are pooled as shorttime spectral representation to train another ConvRBM using greedy layer-wise method. The ConvRBM in second layer trained on spectral representation learns Temporal Receptive Field (TRF) which represent temporal properties of the auditory cortex in human brain. To show the effectiveness of the proposed UDAM, speech recognition experiments were conducted on TIMIT and AURORA 4 databases. We have shown that features extracted from second layer when added to filterbank features of first layer performs better than first layer features alone (and their delta features as well). For both databases, our proposed two-layer deep auditory features improve speech recognition performance over Mel filterbank features. Further improvements can be achieved by system-level combination of both UDAM features and Mel filterbank features.
منابع مشابه
شبکه عصبی پیچشی با پنجرههای قابل تطبیق برای بازشناسی گفتار
Although, speech recognition systems are widely used and their accuracies are continuously increased, there is a considerable performance gap between their accuracies and human recognition ability. This is partially due to high speaker variations in speech signal. Deep neural networks are among the best tools for acoustic modeling. Recently, using hybrid deep neural network and hidden Markov mo...
متن کاملSpeech Emotion Recognition Using Scalogram Based Deep Structure
Speech Emotion Recognition (SER) is an important part of speech-based Human-Computer Interface (HCI) applications. Previous SER methods rely on the extraction of features and training an appropriate classifier. However, most of those features can be affected by emotionally irrelevant factors such as gender, speaking styles and environment. Here, an SER method has been proposed based on a concat...
متن کاملRegularization for Unsupervised Deep Neural Nets
Unsupervised neural networks, such as restricted Boltzmann machines (RBMs) and deep belief networks (DBNs), are powerful tools for feature selection and pattern recognition tasks. We demonstrate that overfitting occurs in such models just as in deep feedforward neural networks, and discuss possible regularization methods to reduce overfitting. We also propose a “partial” approach to improve the...
متن کاملUnsupervised feature learning for audio classification using convolutional deep belief networks
In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning approaches have not been extensively studied for auditory data. In this paper, we apply convolutional deep belief networks to audio data and empirically evaluate them on various audio classification tasks...
متن کاملEstimation of Hand Skeletal Postures by Using Deep Convolutional Neural Networks
Hand posture estimation attracts researchers because of its many applications. Hand posture recognition systems simulate the hand postures by using mathematical algorithms. Convolutional neural networks have provided the best results in the hand posture recognition so far. In this paper, we propose a new method to estimate the hand skeletal posture by using deep convolutional neural networks. T...
متن کامل